The script below is a complete data analysis of the 2016 US presidential election. Your job is to (i) understand what the code is doing, and (ii) convert the script into an R markdown document.
# County-level Data for 2016 Presidential Elections
# This is an analysis of US presidential elections data for 2016 at the county level.
# Since only a small percentage of votes went to independent candidates, we will only
# compare Democrat and Republican voteshare.
# The data for this analysis is taken from
# https://github.com/tonmcg/County_Level_Election_Results_12-16.
#####
# DATA IMPORT AND CHECKING
#####
library(tidyverse)
library(knitr)
df <- read_csv("http://web.stanford.edu/~kjytay/courses/stats32-aut2019/Session%208/2016_US_County_Level_Presidential_Results.csv")
head(df)
# check no. of rows: matches no. of counties in US
# (Source: http://www.snopes.com/trump-won-3084-of-3141-counties-clinton-won-57/ and http://www.wnd.com/2016/12/trumps-landslide-2623-to-489-among-u-s-counties/)
nrow(df)
# dataset columns
# - `per_dem` and `per_gop` refer to the percentage of votes going to Democrats
# and Republicans respectively.
# - `diff` represents the absolute difference between Republican votes - Democrat votes.
# - `per_point_diff` represents this difference as a percentage of total votes.
# - `combined_fips` is a 5-digit code identifying the county.
# (From https://en.wikipedia.org/wiki/FIPS_county_code: The FIPS county code is
# a five-digit Federal Information Processing Standards (FIPS) code (FIPS 6-4)
# which uniquely identifies counties and county equivalents in the United States,
# certain U.S. possessions, and certain freely associated states.)
names(df)
# Since we are interested in whether a given county had more Republican or
# Democrat votes, we have to recompute the `diff` and `per_point_diff` columns.
# `diff` and `per_point_diff` will be positive if there are more Republican votes
# than Democrat votes (and vice versa).
df <- df %>% mutate(diff = votes_gop - votes_dem,
per_point_diff = diff / total_votes * 100)
#####
# SUMMARY STATISTICS
#####
# percentage of popular vote won by each party: Clinton actually wins!
paste0("Republican % of popular vote: ",
round(sum(df$votes_gop) / sum(df$total_votes) * 100, digits = 1),
"%")
paste0("Democrat % of popular vote: ",
round(sum(df$votes_dem) / sum(df$total_votes) * 100, digits = 1),
"%")
# number of counties won by each party: Trump wins by a lot
# hypothesis 1: Margin of victory was slimmer in the counties that Trump won
# compared with the counties that Clinton won.
# hypothesis 2: Clinton won in counties with large populations
df %>% transmute(gop_won = votes_gop > votes_dem) %>%
summarize(gop_won = sum(gop_won))
#####
# HISTOGRAMS
#####
# Test hypothesis 1: histogram of the `per_point_diff`
# Not true that Trump had narrower margins of victory
ggplot() +
geom_histogram(data = df, mapping = aes(x = per_point_diff)) +
labs(title = "Histogram of % vote margin",
x = "% Republicans won by", y = "Frequency")
# Test hypothesis 2: histogram of `diff`
# Completely different picture! Note the x-axis scale
ggplot() +
geom_histogram(data = df, mapping = aes(x = diff)) +
labs(title = "Histogram of absolute vote margin",
x = "No. of votes Republicans won by", y = "Frequency")
# show top 50 counties with largest absolute vote difference
# top 45 counties with largest absolute vote difference were all won by Clinton
# number 46 was Montgomery, TX, which went to Trump
df %>% select(State = state_abbr, County = county_name, diff) %>%
mutate(abs_diff = abs(diff)) %>%
arrange(desc(abs_diff)) %>%
select(State, County, `Vote difference` = diff) %>%
head(n = 50) %>%
kable()
#####
# CONCLUSION
#####
# When analyzing elections, we have to examine the data from many different
# perspectives in order to get the full story.